{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d0dba865",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "### Introduction\n",
    "In active learning (AL), we use a machine learning (ML) model as a surrogate for a more computationally expensive method.  Let's say we want to dock millions of molecules, but have limited computational resources.  We could sample a subset of the molecules we want to dock and dock the subset.  The chemical structures and docking scores for the subset could then be used to build an ML model to predict the docking scores for the entire set of molecules. In AL, we perform multiple cycles of this prediction and sampling workflow. Throughout the process, we want to balance two parameters.\n",
    "- Exploration - efficiently search the chemical space and identify the most promising regions\n",
    "- Explotation - focus the search on the most interesting regions of chemical space\n",
    "\n",
    "This process is illustrated in the figure below.  The two red boxes represent the **oracle** that performs the more expensive calculations.\n",
    "<br>\n",
    "<center><img src=\"images/active_learning.png\" alt=\"active learning figure\" width=\"800\"/></center>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "145192fd",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "1. Begin with a pool of M molecules\n",
    "2. Sample N molecules from the pool\n",
    "3. Perform the computationally expensive calculations on the subset of N molecules.  If our objective is docking, we dock the N molecules.\n",
    "4. The chemical structures and docking scores for the N molecules are used to build an ML model\n",
    "5. The model from step 4 ised to predict values for the M molecules from step 1\n",
    "6. The predictions from the previous step are used to select another set of N molecules.  There are several ways to do this.  One of the simplest is a **greedy** search where we select the N best scoring molecules.  Alternately, we can employ strategies that use the uncertainty in the predictions to direct exploration.\n",
    "7. Perform the computationally expensive calculations on the molecules selected in step 6. If our objective is docking, we would dock the N molecules.\n",
    "8. The results from step 7 are combined with the results from step 3 and the model is retrained.\n",
    "Steps 4 through 8 can be repeated multiple times\n",
    "\n",
    "In the example below, we use AL to dock a set of 100K molecules.  To make this exercise more time efficient, we'll look up the activity rather performing the docking. The code below uses modAL, an open source library for active learning. Modal provides several different AL strategies."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "329cbaf0",
   "metadata": {},
   "source": [
    "### Installation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "85af0192",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-05-05T17:03:06.840173Z",
     "start_time": "2025-05-05T17:03:06.838228Z"
    }
   },
   "outputs": [],
   "source": [
    "%%capture\n",
    "import sys\n",
    "IN_COLAB = 'google.colab' in sys.modules\n",
    "if IN_COLAB:\n",
    "    !pip install pandas numpy seaborn useful_rdkit_utils tqdm scikit-learn 'modAL-python>=0.4.1'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "5de68ff8",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-05-05T17:03:21.475284Z",
     "start_time": "2025-05-05T17:03:21.472516Z"
    }
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import sys\n",
    "if IN_COLAB:\n",
    "  import urllib.request\n",
    "\n",
    "  os.makedirs(\"./data\", exist_ok=True)\n",
    "  url = \"https://raw.githubusercontent.com/PatWalters/practical_cheminformatics_tutorials/main/active_learning/data/screen.csv\"\n",
    "  filename = \"data/screen.csv\"\n",
    "  urllib.request.urlretrieve(url,filename)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f64d67f8",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "### Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "4cce7404",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-05-05T21:38:53.379241Z",
     "start_time": "2025-05-05T21:38:53.376517Z"
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "import useful_rdkit_utils as uru\n",
    "from modAL.models import ActiveLearner\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from tqdm.auto import tqdm\n",
    "from rdkit import rdBase\n",
    "import warnings"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6ba10e39",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Enable Pandas **progress_apply**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "75cbe78c",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-05-05T17:03:32.566341Z",
     "start_time": "2025-05-05T17:03:32.563852Z"
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "tqdm.pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b74fd873",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Read data from [\"Traversing Chemical Space with Active Deep Learning\"](https://chemrxiv.org/engage/chemrxiv/article-details/654a603348dad23120461847) by Derek van Tilborg and Francesca Grisoni. \n",
    "\n",
    "The data is formatted with SMILES and 1 or 0 inidicating active or inactive. \n",
    "```\n",
    "smiles,y\n",
    "COc1cc(/C=N/NC(=O)C(=O)NCC2CCCO2)ccc1O,0\n",
    "CC1CCCCC12NC(=O)N(CC(=O)Nc1ccc(N3CCOCC3)cc1)C2=O,0\n",
    "C[NH+]1CCCC(OC(=O)c2cccc(Cl)c2)C1,0\n",
    "CCOc1ccc(C(=O)NCC(=O)OCC(=O)N2CCCC2)cc1,0\n",
    "```\n",
    "After reading the data w generate fingerprints as descriptors. The function **uru.smi2numpy_fp** takes SMILES as input and returns a fingerprint as a numpy array."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "c4281aa7",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-05-05T17:04:43.999589Z",
     "start_time": "2025-05-05T17:04:35.339921Z"
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "5041caacc319471b906cda129c7a9bbf",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/100000 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "df = pd.read_csv(\"data/screen.csv\")\n",
    "with rdBase.BlockLogs():\n",
    "    df['fp'] = df.smiles.progress_apply(uru.smi2numpy_fp)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bd4cfe76-113b-4e03-9db3-c01d207bf519",
   "metadata": {},
   "source": [
    "Look at the activity distribution in the data. We can see that there are ~95K inactives and 5K actives."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "f89ae4d4-b615-4093-9114-6b02e5b9a6df",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-05-05T17:04:59.870269Z",
     "start_time": "2025-05-05T17:04:59.863418Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "y\n",
       "0    95014\n",
       "1     4986\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.y.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de18c722",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Convert the data to numpy arrays"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "119b1ab5",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-05-05T17:05:03.189234Z",
     "start_time": "2025-05-05T17:05:03.124287Z"
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 53.8 ms, sys: 19.7 ms, total: 73.5 ms\n",
      "Wall time: 75.9 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "X_pool = np.stack(df.fp.values)\n",
    "y_pool = df.y.values"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d98619a8",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "### Initial Model\n",
    "Here's where we define an oracle to return the results of our calculation.  In this case, we're just looking up a value.  In practice an oracle might perform docking calcuations or something else that's more compute itensive. The notebook **active_shape_search.ipynb** has a complete implementation of an oracle. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "54d69d28",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-05-05T17:05:06.076630Z",
     "start_time": "2025-05-05T17:05:06.073693Z"
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "class Oracle:\n",
    "    def __init__(self, df):\n",
    "        self.df = df\n",
    "\n",
    "    def get_values(self, idx_list):\n",
    "        return df.y.values[idx_list]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "149177eb",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Select a set of molecules to build and initial model.  In this case, we'll randomly select 100 molecules and use this set of 100 to build an ML model. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "5bd48a5b",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-05-05T17:05:08.906082Z",
     "start_time": "2025-05-05T17:05:08.896183Z"
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "n_initial = 100\n",
    "initial_list = np.random.choice(range(len(df)), size=n_initial, replace=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5d3cfe78",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Select 100 examples to build the initial model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "475cdb17",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-05-05T17:05:19.585701Z",
     "start_time": "2025-05-05T17:05:19.583378Z"
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "X_training = X_pool[initial_list]\n",
    "y_training = y_pool[initial_list]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f4c00dc9",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "### Active Learning\n",
    "Define an ActiveLearner.  This class holds the ML model used to preform the active learning. In this case we'll use a RandomForestClassifier from scikit_learn as our ML model. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "e6451973",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-05-05T17:05:23.968289Z",
     "start_time": "2025-05-05T17:05:23.909524Z"
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "learner = ActiveLearner(\n",
    "    estimator=RandomForestClassifier(),\n",
    "    X_training=X_training, y_training=y_training\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c71b5e2d",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Instantiate an oracle.  As mentioned above, this is a simple lookup. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "bb44c6cf",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-05-05T17:05:26.378552Z",
     "start_time": "2025-05-05T17:05:26.376411Z"
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "oracle = Oracle(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "07155674",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Run 10 cycles of active learning. We'll print out the number of active molecules we've found at each iteration."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "270d51da",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-05-05T21:49:50.329347Z",
     "start_time": "2025-05-05T21:49:11.804079Z"
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0 15\n",
      "1 46\n",
      "2 59\n",
      "3 110\n",
      "4 128\n",
      "5 149\n",
      "6 184\n",
      "7 225\n",
      "8 276\n",
      "9 302\n"
     ]
    }
   ],
   "source": [
    "# Define a list to keep track of the molecules we've selected\n",
    "pick_list = initial_list\n",
    "# How many molecules we will select at each iteration\n",
    "n_instances = 100\n",
    "for i in range(0, 10):\n",
    "    with warnings.catch_warnings():\n",
    "        warnings.simplefilter(\"ignore\",category=FutureWarning)\n",
    "        # Use the model to select the next set of molecules\n",
    "        query_idx, query_inst = learner.query(X_pool, n_instances=n_instances)\n",
    "        # Use the oracle to look up the value\n",
    "        y_new = oracle.get_values(query_idx)\n",
    "        # Use the values from the oracle to update the model\n",
    "        learner.teach(X_pool[query_idx], y_new)\n",
    "        # Add the picks to pick_list\n",
    "        pick_list = np.append(pick_list, query_idx)\n",
    "        # How many active molecules have we found\n",
    "        print(i,sum(y_pool[pick_list]))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11749652",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "### Compare With a Random Baseline\n",
    "That looks pretty good, but we should compare with a random baseline.  Let's select 1,000 random molecules and see how many actives we find."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "5b7107fb",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-05-05T21:50:02.229643Z",
     "start_time": "2025-05-05T21:50:02.177264Z"
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "random_hit_count_list = []\n",
    "for i in range(0, 10):\n",
    "    random_list = np.random.choice(range(len(df)), size=1000, replace=False)\n",
    "    random_hit_count_list.append(sum(df.y.values[random_list]))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "123e60c8",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Look at the number of active molecules we found with a random search."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "0d0a669f",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-05-05T21:50:05.408140Z",
     "start_time": "2025-05-05T21:50:05.405145Z"
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[44, 49, 57, 50, 60, 59, 56, 52, 47, 46]"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "random_hit_count_list"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "70bf16e1",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Now lets run 10 active learning trials.  To do this, we'll write a function that encapsulates the active learning code we wrote above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "4cb202f9",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-05-05T21:55:28.459892Z",
     "start_time": "2025-05-05T21:55:28.456462Z"
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "def run_active_learning(X, y, oracle, num_cycles=10):\n",
    "    initial_list = np.random.choice(range(len(df)), size=n_initial, replace=False)\n",
    "    pick_list = initial_list\n",
    "    learner = ActiveLearner(\n",
    "        estimator=RandomForestClassifier(),\n",
    "        X_training=X_training, y_training=y_training\n",
    "    )\n",
    "    for i in tqdm(range(0, num_cycles)):\n",
    "        query_idx, query_inst = learner.query(X_pool, n_instances=n_instances)\n",
    "        y_new = oracle.get_values(query_idx)\n",
    "        learner.teach(X_pool[query_idx],y_new)\n",
    "        pick_list = np.append(pick_list, query_idx)\n",
    "    return sum(y[pick_list])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "a5461e15",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-05-05T21:57:28.581129Z",
     "start_time": "2025-05-05T21:55:29.673009Z"
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "66d1742f2b434c9898ff8748b96ce3c9",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/10 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "3a6e102f0d824c6ea8e7f73ff809a7cb",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/10 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "910d49280d6a4779a727c471489e9c6d",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/10 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "18802a4a49204fba8d9472c9bad4398a",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/10 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "abf495df606843a8be56eada5cb1e3f4",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/10 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "018471d351d94133bc9330f18cbf36d8",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/10 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "139c5f27047845b381e5ff8a346d1e9d",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/10 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "722763328a494ebc850ff1a2d63555d9",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/10 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "c2a4b9a8cd614c75936e66949cefc669",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/10 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "49d77ac4bbdd471eb70055d2fc5fd624",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/10 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "active_learning_hit_count_list = []\n",
    "for i in range(0, 10):\n",
    "    with warnings.catch_warnings():\n",
    "        warnings.simplefilter(\"ignore\",category=FutureWarning)\n",
    "        num_hits = run_active_learning(X_pool, y_pool, oracle)\n",
    "        active_learning_hit_count_list.append(num_hits)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9e5291b0",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Look at the number of hits we found with active learning.  Note that it's a lot more than what we found with a random search."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "6dfff580",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-05-05T21:57:33.691070Z",
     "start_time": "2025-05-05T21:57:33.688474Z"
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[326, 244, 268, 307, 258, 232, 283, 299, 290, 207]"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "active_learning_hit_count_list"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7740a196",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Let's make a boxplot to compare the random and active learning searches.   This will be a lot easier if we put the data into a dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "25f6cc01",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-05-05T21:57:37.258242Z",
     "start_time": "2025-05-05T21:57:37.254103Z"
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "random_df = pd.DataFrame(random_hit_count_list)\n",
    "random_df.columns = [\"count\"]\n",
    "random_df['category'] = \"random\"\n",
    "active_df = pd.DataFrame(active_learning_hit_count_list)\n",
    "active_df.columns = [\"count\"]\n",
    "active_df['category'] = \"active learning\"\n",
    "plot_df = pd.concat([random_df, active_df])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2f33f741",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Now make the boxplot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "2fa2a9ac",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-05-05T21:57:39.856137Z",
     "start_time": "2025-05-05T21:57:39.802165Z"
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAGwCAYAAABPSaTdAAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjcsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvTLEjVAAAAAlwSFlzAAAPYQAAD2EBqD+naQAAK2RJREFUeJzt3QucTfX+//HPXBjjMjO5DZMhuRMqxMRRuY2UEpVKUjmmhHIJZw65RSPVUTruj1ySS6nojCJMoXLNIXchGWJwFOM2Y5j9e3y+/8fe/9mug5nZa77zej4eqz17rbXXXntrr/3en+/3u5afy+VyCQAAgKX8fb0DAAAA2YmwAwAArEbYAQAAViPsAAAAqxF2AACA1Qg7AADAaoQdAABgtUBf74ATpKeny8GDB6VIkSLi5+fn690BAACZoKcKPHnypERERIi//5XrN4QdERN0IiMjM/O+AgAAh9m/f7+UKVPmissJOyKmouN+s0JCQnLuXwcAANyw5ORkU6xwf49fCWFHxNN0pUGHsAMAQO5yrS4odFAGAABWI+wAAACrEXYAAIDVCDsAAMBqhB0AAGA1wg4AALAaYQcAAFiNsAMAAKxG2AEAAFYj7AAAAKsRdgAAgNUIOwAAwGqEHQAAYDWueg4AkJSUFElMTOSdcJCyZctKgQIFfL0bViDsAABM0ImJieGdcJBJkyZJ5cqVfb0bViDsAABMFUG/XHO7ffv2yYgRI2TAgAFSrlw5ye3/JsgahB0AgGkusamKoEHHpteDm0MHZQAAYDXCDgAAsBphBwAAWI2wAwAArEbYAQAAViPsAAAAqxF2AACA1Qg7AADAaoQdAABgNcIOAACwGmEHAABYjbADAACsRtgBAABWI+wAAACrEXYAAIDVCDsAAMBqhB0AAGA1wg4AALAaYQcAAFiNsAMAAKxG2AEAAFYj7AAAAKsRdgAAgNUIOwAAwGo+DTvjx4+XWrVqSUhIiJmioqJk4cKFnuUpKSnSrVs3KVasmBQuXFjatWsnhw8f9tpGYmKiPPTQQ1KwYEEpWbKk9O3bV86fP++DVwMAAJzIp2GnTJkyMnLkSFm/fr38/PPP0qRJE3n00Udl69atZnmvXr0kPj5e5s6dK8uXL5eDBw9K27ZtPY+/cOGCCTrnzp2TlStXyvTp02XatGkyaNAgH74qAADgJH4ul8slDlK0aFF555135PHHH5cSJUrIrFmzzN9qx44dUq1aNVm1apU0aNDAVIEefvhhE4LCw8PNOhMmTJD+/fvL0aNHJX/+/Jl6zuTkZAkNDZUTJ06YChMAIHf69ddfJSYmRiZNmiSVK1f29e4gm2X2+9sxfXa0SjNnzhw5ffq0ac7Sak9aWpo0a9bMs07VqlWlbNmyJuwova1Zs6Yn6Kjo6Gjz4t3VoctJTU0162ScAACAnXwedjZv3mz64wQFBcnLL78s8+bNk+rVq0tSUpKpzISFhXmtr8FGlym9zRh03Mvdy64kLi7OJEH3FBkZmS2vDQAA+J7Pw06VKlVk48aNsmbNGunatat06tRJtm3blq3PGRsba0pe7mn//v3Z+nwAAMB3AsXHtHpTsWJF83edOnVk3bp18sEHH0j79u1Nx+Pjx497VXd0NFapUqXM33q7du1ar+25R2u517kcrSLpBAAA7OfzsHOx9PR006dGg0++fPkkISHBDDlXO3fuNEPNtU+P0tsRI0bIkSNHzLBztWTJEtNJSZvCACAn6I8srRLD9/bt2+d1C9/SriIXdzfJc2FHm5MefPBB0+n45MmTZuTVsmXL5NtvvzVvUOfOnaV3795mhJYGmB49epiAoyOxVIsWLUyo6dixo4waNcr00xk4cKA5Nw+VGwA5FXSe7ficpJ1L5Q13EP0hDN/Llz9IPpnxsc8Dj0/DjlZknnvuOTl06JAJN3qCQQ06zZs3N8tHjx4t/v7+prKj1R4daTVu3DjP4wMCAmTBggWmr4+GoEKFCpk+P8OGDfPhqwKQl2hFR4PO2dvvk/QCob7eHcAx/FNOiPy23HxG8nTY+eijj666vECBAjJ27FgzXUm5cuXkm2++yYa9A4DM06CTXqg4bxngQD4fjQUAAJCdCDsAAMBqhB0AAGA1wg4AALAaYQcAAFiNsAMAAKxG2AEAAFYj7AAAAKsRdgAAgNUIOwAAwGqEHQAAYDXCDgAAsBphBwAAWI2wAwAArEbYAQAAViPsAAAAqxF2AACA1Qg7AADAaoQdAABgNcIOAACwGmEHAABYLdDXOwAANvA/e9zXuwA4ipM+E4QdAMgCwXtX8D4CDkXYAYAscLZ8Y0kPDuO9BDJUdpzyI4CwAwBZQINOeqHivJeAA9FBGQAAWI2wAwAArEbYAQAAViPsAAAAqxF2AACA1Qg7AADAaoQdAABgNcIOAACwGmEHAABYjbADAACsRtgBAABWI+wAAACrEXYAAIDVCDsAAMBqhB0AAGA1wg4AALAaYQcAAFiNsAMAAKxG2AEAAFYj7AAAAKsRdgAAgNUIOwAAwGqEHQAAYDXCDgAAsJpPw05cXJzUq1dPihQpIiVLlpQ2bdrIzp07vda5//77xc/Pz2t6+eWXvdZJTEyUhx56SAoWLGi207dvXzl//nwOvxoAAOBEgb588uXLl0u3bt1M4NFw8s9//lNatGgh27Ztk0KFCnnW69KliwwbNsxzX0ON24ULF0zQKVWqlKxcuVIOHTokzz33nOTLl0/eeuutHH9NAADAWXwadhYtWuR1f9q0aaYys379emncuLFXuNEwczmLFy824Wjp0qUSHh4ud955p7z55pvSv39/GTJkiOTPn/+Sx6SmpprJLTk5OUtfFwAAcA5H9dk5ceKEuS1atKjX/JkzZ0rx4sXljjvukNjYWDlz5oxn2apVq6RmzZom6LhFR0ebALN169YrNp+FhoZ6psjIyGx7TQAAIA9XdjJKT0+Xnj17SsOGDU2ocXvmmWekXLlyEhERIZs2bTIVG+3X8+WXX5rlSUlJXkFHue/rssvRwNS7d2/PfQ1GBB4AAOzkmLCjfXe2bNkiP/74o9f8mJgYz99awSldurQ0bdpU9uzZIxUqVLih5woKCjITAGQV/5T/V5kG4LzPhCPCTvfu3WXBggWyYsUKKVOmzFXXrV+/vrndvXu3CTval2ft2rVe6xw+fNjcXqmfDwBkFW0Kz5c/SOS35bypwEX0s6GfkTwddlwul/To0UPmzZsny5Ytk/Lly1/zMRs3bjS3WuFRUVFRMmLECDly5Ijp3KyWLFkiISEhUr169Wx+BQDyOm02/2TGx54+h/Ctffv2me+EAQMGmC4Q8C0NOhd3NclzYUebrmbNmiVfffWVOdeOu4+NvjnBwcGmqUqXt2rVSooVK2b67PTq1cuM1KpVq5ZZV4eqa6jp2LGjjBo1ymxj4MCBZts0VQHICXowd8IBHf+fBp3KlSvzlsD3o7HGjx9vfg3piQO1UuOePv30U7Nch43rkHINNFWrVpU+ffpIu3btJD4+3rONgIAA0wSmt1rlefbZZ815djKelwcAAORdPm/GuhodIaUnHsxMgv/mm2+ycM8AAIAtHHWeHQAAgKxG2AEAAFYj7AAAAKsRdgAAgNUIOwAAwGqEHQAAYDXCDgAAsBphBwAAWI2wAwAArEbYAQAAViPsAAAAqxF2AACA1Qg7AADAaoQdAABgNcIOAACwGmEHAABYjbADAACsRtgBAABWI+wAAACrEXYAAIDVCDsAAMBqhB0AAGA1wg4AALAaYQcAAFiNsAMAAKxG2AEAAFYj7AAAAKsRdgAAgNUIOwAAwGqEHQAAYDXCDgAAsBphBwAAWI2wAwAArEbYAQAAViPsAAAAqxF2AACA1Qg7AADAaoQdAABgNcIOAACwGmEHAABYjbADAACsRtgBAABWI+wAAACrBfp6BwAAvpeSkiKJiYmS2+3bt8/rNjcrW7asFChQwNe7YQXCDgDABJ2YmBhr3okRI0ZIbjdp0iSpXLmyr3fDCoQdAICpIuiXK5z1b4KsQdgBAJjmEqoIsBUdlAEAgNV8Gnbi4uKkXr16UqRIESlZsqS0adNGdu7ceUmnuW7dukmxYsWkcOHC0q5dOzl8+PAlbc0PPfSQFCxY0Gynb9++cv78+Rx+NQAAwIl8GnaWL19ugszq1atlyZIlkpaWJi1atJDTp0971unVq5fEx8fL3LlzzfoHDx6Utm3bepZfuHDBBJ1z587JypUrZfr06TJt2jQZNGiQj14VAABwEj+Xy+UShzh69KipzGioady4sZw4cUJKlCghs2bNkscff9yss2PHDqlWrZqsWrVKGjRoIAsXLpSHH37YhKDw8HCzzoQJE6R///5me/nz57/m8yYnJ0toaKh5vpCQkGx/nQAA4OZl9vvbUX12dGdV0aJFze369etNtadZs2aedapWrWp6qGvYUXpbs2ZNT9BR0dHR5g3YunXrZZ8nNTXVLM84AQAAOzkm7KSnp0vPnj2lYcOGcscdd5h5SUlJpjITFhbmta4GG13mXidj0HEvdy+7Ul8hTYLuKTIyMpteFQAAyJVhp0mTJnL8+PFL5muFRJfdCO27s2XLFpkzZ45kt9jYWFNFck/79+/P9ucEAAC56Dw7y5YtMx2CL6Yjp3744Yfr3l737t1lwYIFsmLFCilTpoxnfqlSpczzaLDKWN3R0Vi6zL3O2rVrvbbnHq3lXudiQUFBZgIAAPa7rrCzadMmz9/btm3zaibSUVGLFi2SW2+9NdPb077RPXr0kHnz5pkAVb58ea/lderUkXz58klCQoIZcq50aLoONY+KijL39VZPC37kyBHTuVnpyC7tqFS9evXreXkAACCvh50777xT/Pz8zHS55qrg4GD58MMPr6vpSkdaffXVV+ZcO+7wpP1odFt627lzZ+ndu7fptKwBRsORBhwdiaV0qLqGmo4dO8qoUaPMNgYOHGi2TfUGAABc19BzvYqsrn777bebpiMdFu6mHYm1shIQEJDpd1VD0+VMnTpVnn/+eU/TWJ8+fWT27NlmFJWOtBo3bpxXE5XuV9euXU11qFChQtKpUycZOXKkBAZmLssx9BwAgNwns9/fjjrPjq8QdgAAsPf7+4YvBLpr1y75/vvvTV8ZHTaeEWcvBgAATnFDYWfy5Mmm2ah48eKmOSljc5T+TdgBAAC5OuwMHz7cjIDSSzIAAABYd1LBv/76S5544oms3xsAAAAnhB0NOosXL87qfQEAAHBGM1bFihXljTfekNWrV5uLcOqJ/zJ69dVXs2r/AAAAbsoNDT2/+EzHXhv085PffvtNchOGngMAkPtk69DzvXv33sy+AQAAOLvPDgAAQG5xQ5WdF1988arLp0yZcqP7AwAA4Puwo0PPM0pLS5MtW7bI8ePHL3uBUAAAgFwVdubNm3fJPL1khJ5VuUKFClmxXwAAAM7qs+Pv7y+9e/eW0aNHZ9UmAQAAnNVBec+ePXL+/Pms3CQAAEDON2NpBScjPVXPoUOH5Ouvv5ZOnTrd3B4BAAD4Ouxs2LDhkiasEiVKyHvvvXfNkVoAAACODzvff/991u8JAACAU8KO29GjR2Xnzp3m7ypVqpjqDgAAQK7voHz69GnTXFW6dGlp3LixmSIiIqRz585y5syZrN9LAACAnAw72kF5+fLlEh8fb04kqNNXX31l5vXp0+dG9wUAAMAZVz0vXry4fP7553L//fdf0pfnySefNM1buQlXPQcAIPfJ7Pf3DVV2tKkqPDz8kvklS5akGQsAADjKDYWdqKgoGTx4sKSkpHjmnT17VoYOHWqWAQAA5OrRWO+//760bNlSypQpI7Vr1zbzfvnlFwkKCpLFixdn9T4CAADkbJ8dd1PWzJkzZceOHeZ+tWrVpEOHDhIcHCy5DX12AAAQa7+/b6iyExcXZ/rsdOnSxWv+lClTTOfk/v3738hmAQAAnNFnZ+LEiVK1atVL5teoUUMmTJiQFfsFAADgu7CTlJRkTih4MT2Dsl4QFAAAIFeHncjISPnpp58uma/z9EzKAAAATnFDfXa0r07Pnj0lLS1NmjRpYuYlJCRIv379OIMyAADI/WGnb9++cuzYMXnllVfk3LlzZl6BAgVMx+TY2Nis3kcAAICcH3quTp06Jdu3bzfDzStVqmTOs5MbMfQcAIDcJ1uHnrsVLlxY6tWrdzObAAAAcF4HZQAAgNyCsAMAAKxG2AEAAFYj7AAAAKsRdgAAgNUIOwAAwGqEHQAAYDXCDgAAsBphBwAAWI2wAwAArEbYAQAAViPsAAAAqxF2AACA1Qg7AADAaoQdAABgNcIOAACwmk/DzooVK6R169YSEREhfn5+Mn/+fK/lzz//vJmfcWrZsqXXOn/++ad06NBBQkJCJCwsTDp37iynTp3K4VcCAACcyqdh5/Tp01K7dm0ZO3bsFdfRcHPo0CHPNHv2bK/lGnS2bt0qS5YskQULFpgAFRMTkwN7DwAAcoNAXz75gw8+aKarCQoKklKlSl122fbt22XRokWybt06qVu3rpn34YcfSqtWreTdd981FSMAAJC3Ob7PzrJly6RkyZJSpUoV6dq1qxw7dsyzbNWqVabpyh10VLNmzcTf31/WrFlzxW2mpqZKcnKy1wQAAOzk6LCjTVgff/yxJCQkyNtvvy3Lly83laALFy6Y5UlJSSYIZRQYGChFixY1y64kLi5OQkNDPVNkZGS2vxYAAJAHm7Gu5amnnvL8XbNmTalVq5ZUqFDBVHuaNm16w9uNjY2V3r17e+5rZYfAAwCAnRxd2bnY7bffLsWLF5fdu3eb+9qX58iRI17rnD9/3ozQulI/H3c/IB29lXECAAB2ylVh58CBA6bPTunSpc39qKgoOX78uKxfv96zznfffSfp6elSv359H+4pAABwCp82Y+n5cNxVGrV3717ZuHGj6XOj09ChQ6Vdu3amSrNnzx7p16+fVKxYUaKjo8361apVM/16unTpIhMmTJC0tDTp3r27af5iJBYAAFB+LpfL5au3QvvePPDAA5fM79Spk4wfP17atGkjGzZsMNUbDS8tWrSQN998U8LDwz3rapOVBpz4+HgzCkvD0ZgxY6Rw4cKZ3g/ts6MdlU+cOEGTFgAAuURmv799GnacgrADAIC939+5qs8OAADA9SLsAAAAqxF2AACA1Qg7AADAaoQdAABgNcIOAACwGmEHAABYjbADAACsRtgBAABWI+wAAACrEXYAAIDVCDsAAMBqhB0AAGA1wg4AALAaYQcAAFiNsAMAAKxG2AEAAFYj7AAAAKsRdgAAgNUIOwAAwGqEHQAAYDXCDgAAsBphBwAAWI2wAwAArEbYAQAAViPsAAAAqxF2AACA1Qg7AADAaoQdAABgNcIOAACwGmEHAABYjbADAACsRtgBAABWI+wAAACrEXYAAIDVCDsAAMBqhB0AAGA1wg4AALAaYQcAAFiNsAMAAKxG2AEAAFYj7AAAAKsRdgAAgNUIOwAAwGqEHQAAYDXCDgAAsBphBwAAWI2wAwAArEbYAQAAVvNp2FmxYoW0bt1aIiIixM/PT+bPn++13OVyyaBBg6R06dISHBwszZo1k127dnmt8+eff0qHDh0kJCREwsLCpHPnznLq1KkcfiUAAMCpfBp2Tp8+LbVr15axY8dedvmoUaNkzJgxMmHCBFmzZo0UKlRIoqOjJSUlxbOOBp2tW7fKkiVLZMGCBSZAxcTE5OCrAAAATubn0vKJA2hlZ968edKmTRtzX3dLKz59+vSR119/3cw7ceKEhIeHy7Rp0+Spp56S7du3S/Xq1WXdunVSt25ds86iRYukVatWcuDAAfP4y0lNTTWTW3JyskRGRprta4UIAAA4n35/h4aGXvP727F9dvbu3StJSUmm6cpNX1D9+vVl1apV5r7eatOVO+goXd/f399Ugq4kLi7ObMs9adABAAB2cmzY0aCjtJKTkd53L9PbkiVLei0PDAyUokWLeta5nNjYWJMC3dP+/fuz5TUAAADfC5Q8KCgoyEwAAMB+jq3slCpVytwePnzYa77edy/T2yNHjngtP3/+vBmh5V4HAADkbY4NO+XLlzeBJSEhwasjkvbFiYqKMvf19vjx47J+/XrPOt99952kp6ebvj0AAAA+bcbS8+Hs3r3bq1Pyxo0bTZ+bsmXLSs+ePWX48OFSqVIlE37eeOMNM8LKPWKrWrVq0rJlS+nSpYsZnp6Wlibdu3c3I7WuNBILAADkLT4NOz///LM88MADnvu9e/c2t506dTLDy/v162fOxaPnzdEKTqNGjczQ8gIFCngeM3PmTBNwmjZtakZhtWvXzpybBwAAwFHn2ckN4/QBAIBz5Prz7AAAAGQFwg4AALAaYQcAAFiNsAMAAKxG2AEAAFYj7AAAAKsRdgAAgNUIOwAAwGqEHQAAYDXCDgAAsBphBwAAWI2wAwAArEbYAQAAViPsAAAAqxF2AACA1Qg7AADAaoQdAABgNcIOAACwGmEHAABYjbADAACsRtgBAABWI+wAAACrEXYAAIDVCDsAAMBqhB0AAGA1wg4AALAaYQcAAFiNsAMAAKxG2AEAAFYj7AAAAKsRdgAAgNUIOwAAwGqEHQAAYDXCDgAAsBphBwAAWI2wAwAArEbYAQAAViPsAAAAqxF2AACA1QJ9vQPIPVJSUiQxMdHXu4EMypYtKwUKFOA9AYCrIOwg0zToxMTE8I45yKRJk6Ry5cq+3g0AcDTCDq6riqBfrrndvn37ZMSIETJgwAApV66c5PZ/EwDA1RF2kGnaXGJTFUGDjk2vBwBweYSdHHL48GE5ceJETj0drlHZyXgL3woNDZXw8HD+GQBkGz+Xy+WSPC45OdkccDWMhISEZEvQebbjc5J2LjXLtw3kdvnyB8knMz4m8ADItu9vKjs5QP8RNOicvf0+SS8QmhNPCeQK/iknRH5bbj4jVHcAZBfCTg7SoJNeqHhOPiUAAHkeJxUEAABWc3TYGTJkiPj5+XlNVatW9TrJXbdu3aRYsWJSuHBhadeunekfAwAAkCvCjqpRo4YcOnTIM/3444+eZb169ZL4+HiZO3euLF++XA4ePCht27b16f4CAABncXyfncDAQClVqtQl87VD40cffSSzZs2SJk2amHlTp06VatWqyerVq6VBgwbiNP5nj/t6FwBH4TMBICc4Puzs2rVLIiIizAntoqKiJC4uzpw1dv369ZKWlibNmjXzrKtNXLps1apVVw07qampZso4dC0nBO9dkSPPAwAAcknYqV+/vkybNk2qVKlimrCGDh0qf/vb32TLli2SlJQk+fPnl7CwMK/H6PBVXXY1Gph0WzntbPnGkh7svb9AXq/s8CMAQJ4OOw8++KDn71q1apnwo6f4/+yzzyQ4OPiGtxsbGyu9e/f2quxERkZKdtOgw9BzAABylqPDzsW0iqPXMtq9e7c0b95czp07J8ePH/eq7uhorMv18ckoKCjITD45gRoAPhMAclSuCjunTp2SPXv2SMeOHaVOnTqSL18+SUhIMEPO1c6dOyUxMdH07XESPZW1nhJfzxQLwJt+NvQzAgB5Muy8/vrr0rp1a9N0pcPKBw8eLAEBAfL000+bg2Pnzp1Nc1TRokXNNTF69Ohhgo7TRmJpPyK99g8XAnUGvQDoiBEjZMCAAeb/LfgWFwIFkKfDzoEDB0ywOXbsmJQoUUIaNWpkhpXr32r06NHi7+9vKjs6uio6OlrGjRsnTqSBh2v/OIsGHW0WBQDYzdFhZ86cOVddrsPRx44dayZkPz1jtTYT2lDZyXibm+mpFvRzAADIpWEHzqJBJyYmRmyhTVm53aRJk6hOAcA1EHZwXVUE/XKFs/5NAABXR9hBpmlzCX1cAAC5jeMvBAoAAHAzCDsAAMBqhB0AAGA1wg4AALAaYQcAAFiNsAMAAKxG2AEAAFYj7AAAAKsRdgAAgNUIOwAAwGqEHQAAYDXCDgAAsBphBwAAWI2rnouIy+Uyb0ZycrKv/z0AAEAmub+33d/jV0LYEZGTJ0+aNyMyMjKz7y8AAHDQ93hoaOgVl/u5rhWH8oD09HQ5ePCgFClSRPz8/Hy9O8iBXwIabPfv3y8hISG834BF+HznLS6XywSdiIgI8fe/cs8cKjvaccnfX8qUKZOT/z5wAA06hB3ATny+847Qq1R03OigDAAArEbYAQAAViPsIM8JCgqSwYMHm1sAduHzjcuhgzIAALAalR0AAGA1wg4AALAaYQcAAFiNsIM86/nnn5c2bdr4ejcAq+mJWufPn5/rnyOz7r//funZs6evdwMX4aSCAICbNmTIEBM4Nm7c6DX/0KFDcsstt+SZd/jLL7+UfPny+Xo3cBHCDhzt3Llzkj9/fl/vBoAbVKpUKSveu7S0tEyFmKJFi+bI/uD60IwFR9EScPfu3U0ZuHjx4hIdHS3/+te/pGbNmlKoUCFzTatXXnlFTp065XnMtGnTJCwsTL799lupVq2aFC5cWFq2bGl+UbpduHBBevfubdYrVqyY9OvX75Kr5Kampsqrr74qJUuWlAIFCkijRo1k3bp1nuXLli0z5XJ9nrvuukuCg4OlSZMmcuTIEVm4cKF5bj1F/TPPPCNnzpzJoXcMyBqLFi0y/8+7PyMPP/yw7Nmzx2udAwcOyNNPP22+0PXzWLduXVmzZo35DA4dOlR++eUX8xnRSedd3MR07733Sv/+/b22efToURMiVqxY4fkcvv7663Lrrbea56hfv7757F0Pve7dk08+aV6L7uujjz4qv//+u2e5fq6bN29ujjF6qYH77rtP/vvf/3ptQ/d7/Pjx8sgjj5j9GDFihKle3XnnnTJjxgy57bbbzGOfeuopz8WkL9eMpeu99dZb8uKLL5rrL5YtW1YmTZrk9VwrV64029Xjjr6n+n7p819cJcONI+zAcaZPn26qOT/99JNMmDDBXLtszJgxsnXrVrPsu+++M2ElIw0X7777rjkI6UEzMTHRHDDd3nvvPXPwnTJlivz444/y559/yrx587y2odv84osvzHPoga9ixYombOm6GekB79///rc5QLkPqu+//77MmjVLvv76a1m8eLF8+OGH2fwuAVnr9OnT5gfBzz//LAkJCeZz99hjj5kLJSv9gaGh4I8//pD//Oc/JtjoZ0aXt2/fXvr06SM1atQwPzJ00nkX69Chg8yZM8frh8ann35qLuL4t7/9zdzXHzurVq0y623atEmeeOIJ8+Nl165dma7A6OdWg8UPP/xgjiPuH0BaKVYaTjp16mSOBatXr5ZKlSpJq1atvEKL+7Ou78HmzZtNWFEaADWMLFiwwEzLly+XkSNHXnWf9PijIWbDhg3mx1rXrl1l586dnguXtm7d2vyg0+POm2++eUkgRBbQq54DTnHfffe57rrrrquuM3fuXFexYsU896dOnapHTtfu3bs988aOHesKDw/33C9durRr1KhRnvtpaWmuMmXKuB599FFz/9SpU658+fK5Zs6c6Vnn3LlzroiICM/jvv/+e/M8S5cu9awTFxdn5u3Zs8cz76WXXnJFR0ffxLsA+N7Ro0fN/9ubN2829ydOnOgqUqSI69ixY5ddf/Dgwa7atWtfMl+3MW/ePPP3kSNHXIGBga4VK1Z4lkdFRbn69+9v/t63b58rICDA9ccff3hto2nTpq7Y2Ngr7mvG55gxY4arSpUqrvT0dM/y1NRUV3BwsOvbb7+97OMvXLhgXlt8fLzXNnv27HnJayxYsKArOTnZM69v376u+vXrex3DXnvtNc/9cuXKuZ599lnPfd2vkiVLusaPH2/u660ez86ePetZZ/Lkyeb5N2zYcMXXjOtDZQeOU6dOHa/7S5culaZNm5qytv5a69ixoxw7dsyrqahgwYJSoUIFz/3SpUub5iV14sQJ80tTy+FugYGB5peWm/5a01+EDRs29MzT0vo999wj27dv99qfWrVqef4ODw83z3377bd7zXM/N5BbaOVEm6j0/2VtjtXmF6VVUqVNKtp8ezN9UkqUKCEtWrSQmTNnmvt79+41VRyt+CitoGiTc+XKlU01xj1p9eTiJrUr0YrT7t27zbHC/Xjd55SUFM82Dh8+LF26dDEVHW2K0terlSv3a3XLeIxw0/dFt325Y82VZDxmaPOU9mNyP0YrPLpcm7Dc9LiDrEUHZTiOto+7aTu79h3Qsq+2metBS0vPnTt3NiVpDRrq4o6DekC5uE9OVsn4XPo8l3tud+kfyC20KaVcuXIyefJk06yk/w/fcccdnqYf7aOWFTTYaN84berVpl9tvtFJaeAICAiQ9evXm9uMNLRkhm5DfzC5A9XFYUtpE5b+YPrggw/Ma9braUVFRXle6+WORW438nnnGOF7VHbgaHrQ0wOJtnk3aNDA/OI7ePDgdW1Df7npry/tSOl2/vx5s203rQq5+wm5aaVHOzJWr149i14N4Ez6xa8VhoEDB5oqqna2/+uvv7zW0eqDVncu7sPmpp8frcpci3YW1iqLdojWsOOu6iitHOk2tOqhfeYyTpkd1XX33XebKpUONLh4G3osUPo518Cl/XS0n5GGnf/973/iC1WqVDEVLe2Y7ZZxYASyBmEHjqYHKA0d+ivwt99+Mx2QtdPy9XrttddMJ0LtWLhjxw7TSfD48eNev+C0etS3b19zEN62bZspc2tTmVaRAJvpeXB0BJaOEtImIB0EoJ2VM9ImLg0ceiJODQv6edQO/doM5W7e0WYpDUQaHDJ+eWeknzXdxhtvvGGaiHW7bvpjRsPPc889Z85Xo9tbu3atxMXFmc7/maGP11FWGqq0g7JuQ0dzabjR0WRKm6/0WKLPrz+C9DFZVbm6Xjp6U3/QxcTEmP3R0Z462MJdNULWIOzA0WrXrm2Gnr/99tumpK6laT3wXS8dKaJ9fbR8reVqbXPXURYZaRhq166dWU9/HepBXw88eemEaMibdOSVjn7Saqd+znr16iXvvPPOJZUbHWmoFROtiGjTk35m3M1N+tnREU8PPPCAaS6aPXv2FZ9Pw4X2rdERWDoUO6OpU6easKOfWa16aDDSSsfF612JNm3riExdv23btqZKpT9YtJqkfXPURx99ZCpX+jnXz7v7lBO+oPsUHx9vQqIOPx8wYIAMGjTILMvYjwc3x097Kd/kNgAAQBbRH3UvvPCCGVzhq4qTbeigDACAD3388cdmFJyOONWKl55nR8/fRdDJOoQdAAB8KCkpyTRd6a0OptATKeroU2QdmrEAAIDV6KAMAACsRtgBAABWI+wAAACrEXYAAIDVCDsAAMBqhB0AAGA1wg6AXGPIkCHmlPoAcD0IOwBwg/QitQCcj7ADIEfpFZ5HjRplrmgfFBRkLtjoPlusniZfr3ytF3PU0+frlbHdgWLatGkydOhQczp9vRq0TjpP6RXs//73v5sLUOqFFZs0aWLWy2j48OHmYo96EVhd9x//+IdXlUj3a9iwYVKmTBmzX7ps0aJFnuW///67ec5PP/1U7rvvPnORRr1KuD7f559/7vVc8+fPN1f3PnnyZLa+lwAyh8tFAMhRsbGxMnnyZBk9erQ0atRIDh06JDt27DDLNIhogImIiJDNmzdLly5dzLx+/fpJ+/btZcuWLSaALF261KwfGhpqbvX0+nodoYULF5p5EydOlKZNm8qvv/4qRYsWNRdW1EA1btw4adiwobnC93vvvSfly5f37NcHH3xg5ulj77rrLpkyZYo88sgjsnXrVqlUqZJnPQ1Jup6uo4FHQ5Veqfvxxx/3rOO+r/sOwAH0qucAkBOSk5NdQUFBrsmTJ2dq/XfeecdVp04dz/3Bgwe7ateu7bXODz/84AoJCXGlpKR4za9QoYJr4sSJ5u/69eu7unXr5rW8YcOGXtuKiIhwjRgxwmudevXquV555RXz9969e116yHz//fe91lmzZo0rICDAdfDgQXP/8OHDrsDAQNeyZcsy9RoBZD+asQDkmO3bt0tqaqqpulyONhFp5aVUqVJSuHBhGThwoCQmJl51m1pZOXXqlBQrVsw8xj3t3btX9uzZY9bZuXOn3HPPPV6Py3g/OTlZDh48aJ47I72v+5xR3bp1L9lOjRo1ZPr06eb+J598IuXKlZPGjRtn6j0BkP1oxgKQY7Sp6UpWrVolHTp0MP1yoqOjTXOUu7npajTo6JWily1bdsmysLAwyWraF+di2gdo7NixpolLm7BeeOEF078HgDNQ2QGQY7TviwaehISES5atXLnSVEQGDBhgqie67r59+7zWyZ8/v1y4cMFr3t133y1JSUkSGBhoOj1nnIoXL27WqVKliqxbt87rcRnvaydj7Sf0008/ea2j96tXr37N1/Xss8+afR0zZoxs27ZNOnXqlMl3BEBOoLIDIMdoh14dcaUdjjW4aDPR0aNHPZ2AtclKqzn16tWTr7/+WubNm+f1+Ntuu800T23cuNGMmtIOwM2aNZOoqChp06aNGeWlo7m0SUof/9hjj5ng1KNHD9PZWf++9957TXPZpk2bzIgvt759+8rgwYOlQoUKZiSWVmj0ebRz87Xccsst0rZtW7ONFi1amH0D4CA50C8IADwuXLjgGj58uKtcuXKufPnyucqWLet66623zLK+ffu6ihUr5ipcuLCrffv2rtGjR7tCQ0M9j9VOyO3atXOFhYWZzsJTp071dHzu0aOH6WSs24yMjHR16NDBlZiY6HnssGHDXMWLFzfbfvHFF12vvvqqq0GDBl77NWTIENett95qtqGdlxcuXOhZ7u6gvGHDhsv+ayYkJJjln332Gf/agMP46X98HbgAIKc1b97cdISeMWNGlmxPt9OrVy9TVdKqFQDnoBkLgPXOnDkjEyZMMB2fAwICZPbs2eZcPUuWLMmSbeu5gkaOHCkvvfQSQQdwIDooA7Cejoz65ptvzHDwOnXqSHx8vHzxxRemv8/N0n5CVatWNVUiPWEiAOehGQsAAFiNyg4AALAaYQcAAFiNsAMAAKxG2AEAAFYj7AAAAKsRdgAAgNUIOwAAwGqEHQAAIDb7Pz2MlZW/CvI5AAAAAElFTkSuQmCC",
      "text/plain": [
       "<Figure size 640x480 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "sns.boxplot(data=plot_df, x=\"category\", y=\"count\");"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "af77878b",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Calculate the enrichment ratio for active learning vs random"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "e7bfd7c8",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-05-05T21:57:42.497117Z",
     "start_time": "2025-05-05T21:57:42.494055Z"
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5.2192307692307685"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.mean(active_learning_hit_count_list) / np.mean(random_hit_count_list)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8f76c501",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-05-05T21:43:17.949912Z",
     "start_time": "2025-05-05T21:43:17.948225Z"
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8bf2cbd55ab234b9",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
